Feature Manipulation in Pandas

Here let's look at a different dataset that will allow us to really dive into some meaningful visualizations. This data set is publically available, but it is also part of a Kaggle competition.

You can get the data from here: https://www.kaggle.com/c/titanic-gettingStarted or you can use the code below to load the data from GitHub.

There are lots of iPython notebooks for looking at the Titanic data. Check them out and see if you like any better than this one!

When going through visualization options, I recommend the following steps:

Look at various high level plotting libraries like:

Adding Dependencies (for Jupyter Lab)

Loading the Titanic Data for Example Visualizations

Questions we might want to ask:

Grouping the Data

Class Exercise 📝:


Cleaning the Dataset

Let's start by visualizing some of the missing data in this dataset. We will use the missingno package to help visualize where the data contains NaNs. This is a great tool for looking at nan values and how we might go about filling in the values.

For this visualization, we can use a visualization library called missingno that hs many types of visuals for looking at missing data in a dataframe. I particularly like the matrix visualization, but there are many more to explore:

Plot Type One: Filter Bar

Imputation Techniques

Let's compare two different techniques from lecture on how to fill in missing data. Recall that imputation should be done with a great deal of caution. Here, the Age variable seems to be missing about 15% of the values. That might be too many to impute. Let's try two methods of imputation on the Age variable:

Split-Impute-Combine in Pandas

Nearest Neighbor Imputation with Scikit-learn

Now let's try to fill in the Age variable by selecting the 3 nearest data points to the given observation. Here, we can use additional variables in the distance calculation, as compared to the need for discrete variable in the split-impute-combine method.

Comparing Imputation Distributions

Now let's see whihc imputation method changed the overall histogram the least. Do you see anything in the plots below that would give preference in one method over another?


[back to slides]

Feature Discretization

This is an example of how to make a continuous feature and ordinal feature. Let's try to give some human intuition to a variable by grouping the data by age.

Question: Does age range influence survival rates?



Visualization in Python with Pandas, Matplotlib, and Others


Visualizing the dataset

Pandas has plenty of plotting abilities built in. Let's take a look at a few of the different graphing capabilities of Pandas with only matplotlib. Afterward, we can make the visualizations more beautiful.

Visualization Techniques: Distributions

Plot Type Two: Histogram and Kernel Density

Question: What were the ages of people on the Titanic?

Two-Dimensional Distributions

The above plot is not all that meaningful. We can probably do better than visualizing the joint distribution using 2D histograms. Let's face it: 2D histrogram are bound to be sparse and not very descriptive. Instead, let's do something smarter.

Feature Correlation Plot

Grouped Count Plots

Used when you have multiple categorical or nominal variables that you want to show together in sub-groups. Grouping mean to display the counts of different subgroups on the dataset. For the titanic data, this can be quite telling of the dataset.

Question: Does age, gender, or class have an effect on survival?

Plot Type Four: Grouped Bar Chart

Sub-group Distribution Plots

TukeyBoxplot

Plot Type Five: Box Plot

The problem with boxplots is that they might hide important aspects of the ditribution. For example, this plot shows data that all have the exact same boxplot.

TukeyBoxplot

Simplifying Plotting with Seaborn

Using pandas and matplotlib is great until you need to redo or make more intricate plots. Let's see about one or two APIs that might simplify our lives. First, let's use Seaborn.

In seaborn, we have access to a number of different plotting tools. Let's take a look at:

Plot Type Six:


Self Test 2a.2

TukeyBoxplot


Matrix Plots


New Question: Which passengers are most similar to one another?


Revisiting other Plots in Seaborn

A Final Note on Plotting:

The best plots that you can make are probably ones that are completely custom to the task or question you are trying to solve/answer. These plots are also the most difficult to get correct because they take a great deal of iteration, time, and effort to get perfected. They also take some time to explain. There is a delicate balance between creating a new plot that answers exactly what you are asking (in the best way possible) and spending and inordinate amount of time on a new plot (when a standard plot might be a "pretty good" answer)

TukeyBoxplot


Revisiting with Interactive Visuals: Plotly

More updates to come to this section of the notebook. Plotly is a major step in the direction of using JavaScript and python together and I would argue it has a much better implementation than other packages.

Visualizing more than three attributes requires a good deal of thought. In the following graph, lets use interactivity to help bolster the analysis. We will create a graph with custom text overlays that help refine the passenger we are looking at. We will

Check more about using plotly here:

In this notebook you learned:

Want some additional practice? Try to create and use some Bokeh examples that are similar to the plots we created